Automatic Identification of Non-compositional Phrases
نویسنده
چکیده
Non-compositional expressions present a special challenge to NLP applications. We present a method for automatic identification of non-compositional expressions using their statistical properties in a text corpus. Our method is based on the hypothesis that when a phrase is non-composition, its mutual information differs significantly from the mutual informations of phrases obtained by substituting one of the word in the phrase with a similar word. 1 I n t r o d u c t i o n Non-compositional expressions present a special challenge to NLP applications. In machine translation, word-for-word translation of non-compositional expressions can result in very misleading (sometimes laughable) translations. In information retrieval, expansion of words in a non-compositional expression can lead to dramatic decrease in precision without any gain in recall. Less obviously, non-compositional expressions need to be treated differently than other phrases in many statistical or corpus-based NLP methods. For example, an underlying assumption in some word sense disambiguation systems, e.g., (Dagan and Itai, 1994; Li et al., 1995; Lin, 1997), is that if two words occurred in the same context, they are probably similar. Suppose we want to determine the intended meaning of "product" in "hot product". We can find other words that are also modified by "hot" (e.g., "hot car") and then choose the meaning of "product" that is most similar to meanings of these words. However, this method fails when non-compositional expressions are involved. For instance, using the same algorithm to determine the meaning of "line" in "hot line", the words "product", "merchandise", "car", etc., would lead the algorithm to choose the "line of product" sense of "line". We present a method for automatic identification of non-compositional expressions using their statistical properties in a text corpus. The intuitive idea behind the method is that the metaphorical usage of a non-compositional expression causes it to have a different distributional characteristic than expressions that are similar to its literal meaning. 2 I n p u t D a t a The input to our algorithm is a collocation database and a thesaurus. We briefly describe the process of obtaining this input. More details about the construction of the collocation database and the thesaurus can be found in (Lin, 1998). We parsed a 125-million word newspaper corpus with Minipar, 1 a descendent of Principar (Lin, 1993; Lin, 1994), and extracted dependency relationships from the parsed corpus. A dependency relationship is a triple: (head type modif ie r ) , where head and modif ier are words in the input sentence and type is the type of the dependency relation. For example, (la) is an example dependency tree and the set of dependency triples extracted from (la) are shown in (lb).
منابع مشابه
When a Red Herring in Not a Red Herring: Using Compositional Methods to Detect Non-Compositional Phrases
Non-compositional phrases such as red herring and weakly compositional phrases such as spelling bee are an integral part of natural language (Sag et al., 2002). They are also the phrases that are difficult, or even impossible, for good compositional distributional models of semantics. Compositionality detection therefore provides a good testbed for compositional methods. We compare an integrate...
متن کاملRelation Acquisition over Compositional Phrases
Relations, after morphemes and words, are the next level of building blocks of language. To successfully employ relations in language applications like unrestricted question answering, we must be able to acquire them automatically. I propose to take two new steps towards this goal: to combine existing relation learning algorithms in a single joint or simultaneous algorithm for higher accuracy, ...
متن کاملIdentification of Basic Phrases for Kazakh Language using Maximum Entropy Model
This paper proposes the definition, classification and structure of the Kazakh basic phrases, and sets up a framework for classifying them according to their syntactic functions. Meanwhile, the structure of the Kazakh basic phrases were analyzed; and the determination of the Kazakh basic phrases collocation and extraction of the Kazakh basic phrases based on rules were followed. The Maximum Ent...
متن کاملGoogle Web 1T 5-Grams Made Easy (but not for the computer)
This paper introduces Web1T5-Easy, a simple indexing solution that allows interactive searches of the Web 1T 5-gram database and a derived database of quasi-collocations. The latter is validated against co-occurrence data from the BNC and ukWaC on the automatic identification of non-compositional VPC.
متن کاملCross-Lingual Variation of Light Verb Constructions: Using Parallel Corpora and Automatic Alignment for Linguistic Research
Cross-lingual parallelism and small-scale language variation have recently become subject of research in both computational and theoretical linguistics. In this article, we use a parallel corpus and an automatic aligner to study English light verb constructions and their German translations. We show that parallel corpus data can provide new empirical evidence for better understanding the proper...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1999